Luis Carlos Olivares Rueda
Applied Physicist
In this work we are using regression models to predict house prices based on several elements such as number of bathrooms, year of construction, wheter it has or not a basement, etc.
This work covers:
The dataset was taken from Kaggle.
The dataset contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames and Iowa.
As we are dealing with house prices we can expect that the year of construction, the total area inside the house (dimensions), number of rooms and bathrooms will have a great impact on the house prices.
Because of this we can expect from the data selection and regression analysis that the most important features will be the ones that we expect.
#Import all the neccesary libraries
import numpy as np
import pandas as pd
from scipy.stats import norm, skew, kurtosis
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import statsmodels.graphics.gofplots as sm
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.linear_model import BayesianRidge, LinearRegression, Lars
from sklearn.metrics import mean_squared_error, r2_score
import pycaret.regression as pr
import optuna
import matplotlib.pyplot as plt
import seaborn as sns
import scienceplots
plt.style.use(['notebook', 'grid', 'nature'])
First we are going to define some important functions that we will be using through the project.
# missing_vals function calculates the number and percentage of missing values and data type of each predictor given a pandas dataframe
def missing_vals(df):
# Number of missing values
missing = df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False).values
# Percentage of missing values
percentage = (df.isna().mean()*100)[df.isna().mean()*100 > 0].sort_values(ascending=False).values
# Names of predictors with missing values
names = df.isna().sum()[df.isna().sum() > 0].sort_values(ascending=False).index
# Data type of predictors with missing values
dtypes = df[names].dtypes.values
# Collect information into an array
data = np.array([dtypes, missing, percentage]).T
# Transform the array into a pandas DataFrame
return pd.DataFrame(data=data, index=names, columns=['Dtypes', '#Missing Values', '%Missing Values'])
# skew_kurtosis function calculates the kurtosis and skewness of each predictor of a given dataset
def skew_kurtosis(df):
# Extract numeric features
numeric_features = df.dtypes[df.dtypes != 'object'].index
# Calculates skewness and kurtosis
skewness_vals = df[numeric_features].apply(axis=0, func=lambda x: skew(x)).values
kurtosis_vals = df[numeric_features].apply(axis=0, func=lambda x: kurtosis(x)).values
# Collect information into an array
data = np.array([skewness_vals, kurtosis_vals]).T
# Transform the array into a pandas DataFrame
return pd.DataFrame(data=data, index=numeric_features, columns=['Skewness', 'Kurtosis'])
# compute_vif function calculates the variance-inflation-factor of each predictor of a given dataset
def compute_vif(df, considered_features):
# Use only the selected features
X = df[considered_features]
X = add_constant(X)
# Calculate the variance-inflation-factor of each predictor and stores it in a pandas DataFrame
vif = pd.DataFrame()
vif["Variable"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif = vif[vif['Variable']!='const']
return vif.sort_values(by=['VIF'], ascending=False)
# Given a dataset, a list of predictors and a maximum acceptable value for the variance-inflation-factor (threshold), the function eliminates the predictors that cause the factor to increase until the threshold value is reached
def reduce_vif(df, threshold, considered_features):
# Compute VIF
vif = compute_vif(df, considered_features)
discarded_features = []
# Each iteration VIF is calculated and predictors are drop until a threshold value is reached
while vif.iloc[0, 1] > threshold:
discarded_features.append(vif.iloc[0, 0])
vif = compute_vif(df, considered_features.drop(discarded_features))
return vif
# Import the dataset
df = pd.read_csv("data.csv")
df.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
# Drop the ID column
df.drop(['Id'], axis=1, inplace=True)
df.head()
| MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | Condition1 | Condition2 | BldgType | HouseStyle | OverallQual | OverallCond | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | Exterior1st | Exterior2nd | MasVnrType | MasVnrArea | ExterQual | ExterCond | Foundation | BsmtQual | BsmtCond | BsmtExposure | BsmtFinType1 | BsmtFinSF1 | BsmtFinType2 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | Heating | HeatingQC | CentralAir | Electrical | 1stFlrSF | 2ndFlrSF | LowQualFinSF | GrLivArea | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | KitchenQual | TotRmsAbvGrd | Functional | Fireplaces | FireplaceQu | GarageType | GarageYrBlt | GarageFinish | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2003 | 2003 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 196.0 | Gd | TA | PConc | Gd | TA | No | GLQ | 706 | Unf | 0 | 150 | 856 | GasA | Ex | Y | SBrkr | 856 | 854 | 0 | 1710 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 8 | Typ | 0 | NaN | Attchd | 2003.0 | RFn | 2 | 548 | TA | TA | Y | 0 | 61 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | FR2 | Gtl | Veenker | Feedr | Norm | 1Fam | 1Story | 6 | 8 | 1976 | 1976 | Gable | CompShg | MetalSd | MetalSd | None | 0.0 | TA | TA | CBlock | Gd | TA | Gd | ALQ | 978 | Unf | 0 | 284 | 1262 | GasA | Ex | Y | SBrkr | 1262 | 0 | 0 | 1262 | 0 | 1 | 2 | 0 | 3 | 1 | TA | 6 | Typ | 1 | TA | Attchd | 1976.0 | RFn | 2 | 460 | TA | TA | Y | 298 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | Inside | Gtl | CollgCr | Norm | Norm | 1Fam | 2Story | 7 | 5 | 2001 | 2002 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 162.0 | Gd | TA | PConc | Gd | TA | Mn | GLQ | 486 | Unf | 0 | 434 | 920 | GasA | Ex | Y | SBrkr | 920 | 866 | 0 | 1786 | 1 | 0 | 2 | 1 | 3 | 1 | Gd | 6 | Typ | 1 | TA | Attchd | 2001.0 | RFn | 2 | 608 | TA | TA | Y | 0 | 42 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | Corner | Gtl | Crawfor | Norm | Norm | 1Fam | 2Story | 7 | 5 | 1915 | 1970 | Gable | CompShg | Wd Sdng | Wd Shng | None | 0.0 | TA | TA | BrkTil | TA | Gd | No | ALQ | 216 | Unf | 0 | 540 | 756 | GasA | Gd | Y | SBrkr | 961 | 756 | 0 | 1717 | 1 | 0 | 1 | 0 | 3 | 1 | Gd | 7 | Typ | 1 | Gd | Detchd | 1998.0 | Unf | 3 | 642 | TA | TA | Y | 0 | 35 | 272 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | FR2 | Gtl | NoRidge | Norm | Norm | 1Fam | 2Story | 8 | 5 | 2000 | 2000 | Gable | CompShg | VinylSd | VinylSd | BrkFace | 350.0 | Gd | TA | PConc | Gd | TA | Av | GLQ | 655 | Unf | 0 | 490 | 1145 | GasA | Ex | Y | SBrkr | 1145 | 1053 | 0 | 2198 | 1 | 0 | 2 | 1 | 4 | 1 | Gd | 9 | Typ | 1 | TA | Attchd | 2000.0 | RFn | 3 | 836 | TA | TA | Y | 192 | 84 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
# Create a copy of the data
df1 = df.copy()
The number, percentage and data type of missing values is calculated.
Then a heatmap of missing values was created.
In this problem the most of the missing values do not mean that we do not know the information, instead it means that the house does not have something, like a pool or basement for example. So we need to substitute the NaN values for something else.
We also need to use the data_description.txt file to know how we can deal with the features.
# missing values table
missing_vals(df1)
| Dtypes | #Missing Values | %Missing Values | |
|---|---|---|---|
| PoolQC | object | 1453 | 99.520548 |
| MiscFeature | object | 1406 | 96.30137 |
| Alley | object | 1369 | 93.767123 |
| Fence | object | 1179 | 80.753425 |
| FireplaceQu | object | 690 | 47.260274 |
| LotFrontage | float64 | 259 | 17.739726 |
| GarageType | object | 81 | 5.547945 |
| GarageYrBlt | float64 | 81 | 5.547945 |
| GarageFinish | object | 81 | 5.547945 |
| GarageQual | object | 81 | 5.547945 |
| GarageCond | object | 81 | 5.547945 |
| BsmtExposure | object | 38 | 2.60274 |
| BsmtFinType2 | object | 38 | 2.60274 |
| BsmtFinType1 | object | 37 | 2.534247 |
| BsmtCond | object | 37 | 2.534247 |
| BsmtQual | object | 37 | 2.534247 |
| MasVnrArea | float64 | 8 | 0.547945 |
| MasVnrType | object | 8 | 0.547945 |
| Electrical | object | 1 | 0.068493 |
# heatmap of the missing values
plt.figure(figsize=(20, 7))
sns.heatmap(df1.isna(), cbar=False)
plt.show()
# Based on the data_description.txt file, the missing values were imputed
# Here MasVnrArea,GarageArea and GarageYrBlt were filled with 0's
fill_zero = ['MasVnrArea', 'GarageArea', 'GarageYrBlt']
df1[fill_zero] = SimpleImputer(strategy='constant', fill_value=0).fit_transform(df1[fill_zero])
# Here MSSubClass,YearBuilt, YearRemodAdd and others were changed to object data type
change_cat = ['MSSubClass', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold', 'OverallQual', 'OverallCond']
df1[change_cat] = df1[change_cat].astype(object)
# Here MasVnrArea,GarageArea and GarageYrBlt were filled with None's
fill_none = ['Alley', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'MasVnrType']
df1[fill_none] = SimpleImputer(strategy='constant', fill_value='None').fit_transform(df1[fill_none])
# Electrical has 1 missing value so we can delete it
delete_rows = ['Electrical']
df1.dropna(axis=0, subset=delete_rows, inplace=True)
# LotFrontage has some missing values, here we used a k-nearest-neighbors imputer to fill the values and for the number of neighbors we use the value of 5.
fill_num = ['LotFrontage']
knn_imputer = KNNImputer(n_neighbors=5)
df1[fill_num] = knn_imputer.fit_transform(df1[fill_num])
# missing values table
missing_vals(df1)
| Dtypes | #Missing Values | %Missing Values |
|---|
# Create a copy of the data
df2 = df1.copy()
Here, we created new variables that could be useful on the regression analysis.
# Square per Room
df2["SqFtPerRoom"] = df2["GrLivArea"] / (df2["TotRmsAbvGrd"] + df2["FullBath"] + df2["HalfBath"] + df2["KitchenAbvGr"])
# Total Home Quality
df2['Total_Home_Quality'] = df2['OverallQual'] + df2['OverallCond']
# Total Bathrooms
df2['Total_Bathrooms'] = (df2['FullBath'] + (0.5*df2['HalfBath']) + df2['BsmtFullBath'] + (0.5*df2['BsmtHalfBath']))
# HighQualSF
df2["HighQualSF"] = df2["1stFlrSF"] + df2["2ndFlrSF"]
#Create a copy of the data
df3 = df2.copy()
Several regression methods work better if the data is normalized. Here we calculate de Skewness and Kurtosis of the target (House Prices) and plotted its distribution. As you will see, the original data is left-skewed so we have to normalize it using the yeo-johnson transformation.
skew_kurtosis(df3[['SalePrice']])
| Skewness | Kurtosis | |
|---|---|---|
| SalePrice | 1.880008 | 6.502799 |
# Histogram and Normal Probability Plot of the target (House Pricing)
fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(20, 7))
sns.histplot(df3['SalePrice'], stat='density', color='orange', ax=ax1)
mu, std = norm.fit(df3['SalePrice'])
xx = np.linspace(*ax1.get_xlim(),100)
ax1.set_title('Sales Price Distribution')
sns.lineplot(x=xx, y=norm.pdf(xx, mu, std), ax=ax1)
sm.ProbPlot(df3['SalePrice']).qqplot(line='s', ax=ax2)
ax1.set_title('Normal Probability Plot of Sales Price')
plt.show()
# Using yeo-johnson transformation on the target
target_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
# Histogram and Normal Probability Plot of the transformed target (Normalized Target)
df3['Transformed_SalePrice'] = target_transformer.fit_transform(df3[['SalePrice']]).T[0]
fig, (ax1, ax2) = plt.subplots(ncols=2, nrows=1, figsize=(20, 7))
sns.histplot(df3['Transformed_SalePrice'], stat='density', color='orange', ax=ax1)
mu, std = norm.fit(df3['Transformed_SalePrice'])
xx = np.linspace(*ax1.get_xlim(),100)
ax1.set_title('Transformed Sales Price Distribution')
sns.lineplot(x=xx, y=norm.pdf(xx, mu, std), ax=ax1)
sm.ProbPlot(df3['Transformed_SalePrice']).qqplot(line='s', ax=ax2)
ax1.set_title('Normal Probability Plot of Transformed Sales Price')
plt.show()
# Create a copy of the data
df3.drop(['SalePrice'], axis=1, inplace=True)
df4 = df3.copy()
Here we calculate the Skewness and Kurtosis of the predictors and normalized the data with the yeo-johnson transformation only if the $|skewness(x)|<2$ and $|kurtosis(x)|<7$.
# Skewness and kurtosis of the predictors
skew_kurtosis(df4.drop(['Transformed_SalePrice'], axis=1))
| Skewness | Kurtosis | |
|---|---|---|
| LotFrontage | 2.382060 | 21.754015 |
| LotArea | 12.190881 | 202.402120 |
| MasVnrArea | 2.673798 | 10.095230 |
| BsmtFinSF1 | 1.683465 | 11.079615 |
| BsmtFinSF2 | 4.249219 | 20.023898 |
| BsmtUnfSF | 0.918367 | 0.466639 |
| TotalBsmtSF | 1.525190 | 13.232154 |
| 1stFlrSF | 1.375089 | 5.724629 |
| 2ndFlrSF | 0.813466 | -0.554484 |
| LowQualFinSF | 8.998885 | 82.885802 |
| GrLivArea | 1.364297 | 4.868582 |
| BsmtFullBath | 0.594354 | -0.841470 |
| BsmtHalfBath | 4.097541 | 16.322022 |
| FullBath | 0.037821 | -0.858040 |
| HalfBath | 0.677275 | -1.073973 |
| BedroomAbvGr | 0.211839 | 2.215847 |
| KitchenAbvGr | 4.482026 | 21.436776 |
| TotRmsAbvGrd | 0.676068 | 0.872000 |
| Fireplaces | 0.647913 | -0.221309 |
| GarageCars | -0.341494 | 0.214062 |
| GarageArea | 0.179081 | 0.907592 |
| WoodDeckSF | 1.539362 | 2.974720 |
| OpenPorchSF | 2.361099 | 8.452397 |
| EnclosedPorch | 3.085342 | 10.381118 |
| 3SsnPorch | 10.290132 | 123.147774 |
| ScreenPorch | 4.116334 | 18.356321 |
| PoolArea | 14.807992 | 222.344724 |
| MiscVal | 24.443278 | 698.121807 |
| SqFtPerRoom | 0.980318 | 2.875496 |
| Total_Bathrooms | 0.265074 | -0.138523 |
| HighQualSF | 1.328266 | 4.853191 |
# Find the parameters with (abs(skew(x)) < 2) and (abs(kurtosis(x)) < 7)
skewed_values = skew_kurtosis(df4.drop(['Transformed_SalePrice'], axis=1))
threshold = (np.abs(skewed_values['Skewness']) < 2) | (np.abs(skewed_values['Kurtosis']) < 7)
skewed_values[threshold]
| Skewness | Kurtosis | |
|---|---|---|
| BsmtFinSF1 | 1.683465 | 11.079615 |
| BsmtUnfSF | 0.918367 | 0.466639 |
| TotalBsmtSF | 1.525190 | 13.232154 |
| 1stFlrSF | 1.375089 | 5.724629 |
| 2ndFlrSF | 0.813466 | -0.554484 |
| GrLivArea | 1.364297 | 4.868582 |
| BsmtFullBath | 0.594354 | -0.841470 |
| FullBath | 0.037821 | -0.858040 |
| HalfBath | 0.677275 | -1.073973 |
| BedroomAbvGr | 0.211839 | 2.215847 |
| TotRmsAbvGrd | 0.676068 | 0.872000 |
| Fireplaces | 0.647913 | -0.221309 |
| GarageCars | -0.341494 | 0.214062 |
| GarageArea | 0.179081 | 0.907592 |
| WoodDeckSF | 1.539362 | 2.974720 |
| SqFtPerRoom | 0.980318 | 2.875496 |
| Total_Bathrooms | 0.265074 | -0.138523 |
| HighQualSF | 1.328266 | 4.853191 |
# Transformation of the predictors with with (abs(skew(x)) < 2) and (abs(kurtosis(x)) < 7)
skewed_features = skewed_values[threshold].index
skewed_features
parameter_transformer = PowerTransformer(method='yeo-johnson', standardize=False)
df4[skewed_features] = parameter_transformer.fit_transform(df4[skewed_features])
# Create a copy of the data
df5 = df4.copy()
Categorical variables can not be used as they are in regression models, so we need to encode them into numerical values.
Some of the categorical variables have an order that matter, for this variables we use an ordinal encoder.
The file data_description.txt is used to decide which predictors are going to encode with the ordinal encoder.
# Different predictors were encoded with the OdinalEnconder() function from sklearn
ordinal_features = ['MSSubClass', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Functional', 'Fence', 'GarageFinish', 'LandSlope', 'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'YrSold', 'MoSold', 'YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Total_Home_Quality']
ordinal_encoder = OrdinalEncoder()
df5[ordinal_features] = ordinal_encoder.fit_transform(df5[ordinal_features])
# All numerical features were stored for transform them later
standardize_features = df5.dtypes[df5.dtypes != 'object'].index
standardize_features = standardize_features[:-1]
# Create a copy of the data
df6 = df5.copy()
Some of the categorical variables do not have an order that matter, for this variables we use an one-hot-encoder encoder.
The file data_description.txt is used to decide which predictors are going to encode with the one-hot-encoder encoder.
# OneHotEncoder from sklearn was used to encode several predictors
ohe_features = df6[df6.dtypes[df6.dtypes == 'object'].index].columns
ohe_encoder = OneHotEncoder(sparse=False, drop=None)
ohe_encoded = ohe_encoder.fit_transform(df6[ohe_features])
# This piece of code stores the name of the new variables created after encoding
ohe_categories = []
counter = 0
for i in ohe_encoder.categories_:
for j in i:
counter += 1
ohe_categories.append(j + str(counter))
# Original variables before encoding were deleted
df6.drop(ohe_features, axis=1, inplace=True)
other_features = df6.columns.values
# Ohe encoded data was concatenated with the rest of the data
concatenated_data = np.concatenate((df6.values, ohe_encoded), axis=1)
transformed_data = pd.DataFrame(data=concatenated_data, columns=[*other_features, *ohe_categories])
# Create a copy of the data
df7 = transformed_data.copy()
Several predictors needs to be standardized because it will be helpful analyzing the data and the future regression analysis.
# Standardization was performed with the StandardScaler function from sklearn
standard_scaler = StandardScaler()
df7[standardize_features] = standard_scaler.fit_transform(df7[standardize_features])
# Create a copy of the data
df8 =df7.copy()
Not all the predictors will help us in the regression analysis, instead they may cause problems.
First of all, we need to delete all variables that are not correlated with the target (House Prices).
Then we will delete the predictors that are dependent on others predictors because we want to avoid multicolinearity.
Predictors must be related to the target, if they are not related they can be harmful to the regression analysis or other techniques.
# Pearson correlation coefficient was used to determine the correlation between the predictors and the target (House-Prices)
corr = df8.corr(method='pearson')[['Transformed_SalePrice']].sort_values(by=['Transformed_SalePrice'], ascending=False)
Features selected have the following characteristic: $|abs(Pearson(x))|\leq 1$ and $|abs(Pearson(x))|\geq 0.4$
# Selection of the correlated features with the target
selected_features_1 = corr[(np.abs(corr['Transformed_SalePrice']) <= 1) & (np.abs(corr['Transformed_SalePrice']) >= 0.4)]
features_1 = selected_features_1.index[1:]
selected_features_1
| Transformed_SalePrice | |
|---|---|
| Transformed_SalePrice | 1.000000 |
| OverallQual | 0.815235 |
| HighQualSF | 0.736232 |
| GrLivArea | 0.729388 |
| GarageCars | 0.683783 |
| Total_Bathrooms | 0.676208 |
| GarageArea | 0.647242 |
| Total_Home_Quality | 0.645402 |
| TotalBsmtSF | 0.611456 |
| 1stFlrSF | 0.607305 |
| GarageYrBlt | 0.602061 |
| YearBuilt | 0.599856 |
| SqFtPerRoom | 0.592996 |
| FullBath | 0.592231 |
| YearRemodAdd | 0.567088 |
| TotRmsAbvGrd | 0.538937 |
| PConc123 | 0.530161 |
| Fireplaces | 0.511809 |
| MasVnrArea | 0.421457 |
| Attchd139 | 0.419819 |
| GarageFinish | -0.414034 |
| HeatingQC | -0.425112 |
| KitchenQual | -0.526739 |
| BsmtQual | -0.572155 |
| ExterQual | -0.574202 |
# Create a copy of the data
df9 = df8[selected_features_1.index].copy()
# Correlation Heatmap of selected variables
plt.figure(figsize=(20, 7))
sns.heatmap(df9[features_1].corr())
plt.show()
Here we are going to delete the predictors that depend on other predictors, we will be doing this by calculating the VIF values of the data and we will be dropping off the variables that may cause the VIF values to increase.
The maximum VIF value allowed is 5.
# VIF of the last data processed
compute_vif(df9, features_1)
| Variable | VIF | |
|---|---|---|
| 3 | GrLivArea | 172.929295 |
| 2 | HighQualSF | 111.020176 |
| 15 | TotRmsAbvGrd | 22.159986 |
| 12 | SqFtPerRoom | 18.155202 |
| 1 | OverallQual | 7.166322 |
| 6 | GarageArea | 6.526810 |
| 4 | GarageCars | 5.715406 |
| 11 | YearBuilt | 5.358260 |
| 10 | GarageYrBlt | 5.153571 |
| 7 | Total_Home_Quality | 3.707992 |
| 9 | 1stFlrSF | 3.407825 |
| 13 | FullBath | 3.311463 |
| 5 | Total_Bathrooms | 2.881309 |
| 8 | TotalBsmtSF | 2.805675 |
| 16 | PConc123 | 2.627893 |
| 14 | YearRemodAdd | 2.562225 |
| 24 | ExterQual | 2.399325 |
| 23 | BsmtQual | 2.281713 |
| 22 | KitchenQual | 1.963198 |
| 21 | HeatingQC | 1.558463 |
| 20 | GarageFinish | 1.540274 |
| 19 | Attchd139 | 1.531702 |
| 17 | Fireplaces | 1.518575 |
| 18 | MasVnrArea | 1.359740 |
# The maximum VIF value allowed is 5
selected_features_2 = reduce_vif(df9, 5, features_1)
features_2 = selected_features_2['Variable'].to_list()
selected_features_2
| Variable | VIF | |
|---|---|---|
| 7 | YearBuilt | 4.748294 |
| 6 | GarageYrBlt | 4.138371 |
| 5 | 1stFlrSF | 3.138479 |
| 4 | TotalBsmtSF | 2.702933 |
| 1 | GarageCars | 2.681481 |
| 9 | FullBath | 2.556512 |
| 12 | PConc123 | 2.537507 |
| 2 | Total_Bathrooms | 2.488452 |
| 10 | YearRemodAdd | 2.407694 |
| 20 | ExterQual | 2.336097 |
| 19 | BsmtQual | 2.227776 |
| 18 | KitchenQual | 1.946509 |
| 11 | TotRmsAbvGrd | 1.910599 |
| 8 | SqFtPerRoom | 1.803938 |
| 3 | Total_Home_Quality | 1.774004 |
| 17 | HeatingQC | 1.546306 |
| 15 | Attchd139 | 1.528695 |
| 16 | GarageFinish | 1.504329 |
| 13 | Fireplaces | 1.498017 |
| 14 | MasVnrArea | 1.344485 |
# Heatmap of correlation values
plt.figure(figsize=(20, 7))
sns.heatmap(df9[features_2].corr())
plt.show()
# Create a copy of the data
features_2.append('Transformed_SalePrice')
df10 = df9[features_2].copy()
Here we are using an automatic Outlier Removal method, the function used was provided by sklearn and is named LocalOutlierFactor(), this method compares the distance between points and decides if a value is an outlier or not (It is based on k-nearest neighbors).
The numbers of neighbors (numbers to be compared) chosen is 20.
The function throws a value and the data chosen must have a value grater or equal than -1.4
# LocalOutlierDetector was used and the negative values were stored in the last dataset
outlier_detector = LocalOutlierFactor(n_neighbors=20)
outlier_detector.fit_predict(df10)
df10['NOF'] = outlier_detector.negative_outlier_factor_
# Only the data with NOF greater than -1.4 was selected
print('Original Shape', df10.shape)
df10 = df10[df10['NOF'] >= -1.4]
df10.drop(['NOF'], axis=1, inplace=True)
print('New Shape', df10.shape)
Original Shape (1459, 22) New Shape (1387, 21)
# Create a copy of the data
df11 = df10.copy()
50% of the data was used for training the regression models, 25% for testing and 25% for validation.
# train_test_split function from sklearn is used to split the dataset into train, test and validation sets
X = df11.drop(['Transformed_SalePrice'], axis=1)
y = df11['Transformed_SalePrice']
X_train, X_rem, y_train, y_rem = train_test_split(X, y, train_size=0.5, random_state=86987)
X_valid, X_test, y_valid, y_test = train_test_split(X_rem, y_rem, test_size=0.5, random_state=12345)
Several regression techniques exist and can be found on books, papers or internet, but there are libraries that can be helpful comparing several models. For the model selection we will be using Pycaret library
# Configuration of pycaret environment
_ = pr.setup(data=df11, target='Transformed_SalePrice', session_id=12345)
| Description | Value | |
|---|---|---|
| 0 | session_id | 12345 |
| 1 | Target | Transformed_SalePrice |
| 2 | Original Data | (1387, 21) |
| 3 | Missing Values | False |
| 4 | Numeric Features | 18 |
| 5 | Categorical Features | 2 |
| 6 | Ordinal Features | False |
| 7 | High Cardinality Features | False |
| 8 | High Cardinality Method | None |
| 9 | Transformed Train Set | (970, 20) |
| 10 | Transformed Test Set | (417, 20) |
| 11 | Shuffle Train-Test | True |
| 12 | Stratify Train-Test | False |
| 13 | Fold Generator | KFold |
| 14 | Fold Number | 10 |
| 15 | CPU Jobs | -1 |
| 16 | Use GPU | False |
| 17 | Log Experiment | False |
| 18 | Experiment Name | reg-default-name |
| 19 | USI | 42a5 |
| 20 | Imputation Type | simple |
| 21 | Iterative Imputation Iteration | None |
| 22 | Numeric Imputer | mean |
| 23 | Iterative Imputation Numeric Model | None |
| 24 | Categorical Imputer | constant |
| 25 | Iterative Imputation Categorical Model | None |
| 26 | Unknown Categoricals Handling | least_frequent |
| 27 | Normalize | False |
| 28 | Normalize Method | None |
| 29 | Transformation | False |
| 30 | Transformation Method | None |
| 31 | PCA | False |
| 32 | PCA Method | None |
| 33 | PCA Components | None |
| 34 | Ignore Low Variance | False |
| 35 | Combine Rare Levels | False |
| 36 | Rare Level Threshold | None |
| 37 | Numeric Binning | False |
| 38 | Remove Outliers | False |
| 39 | Outliers Threshold | None |
| 40 | Remove Multicollinearity | False |
| 41 | Multicollinearity Threshold | None |
| 42 | Remove Perfect Collinearity | True |
| 43 | Clustering | False |
| 44 | Clustering Iteration | None |
| 45 | Polynomial Features | False |
| 46 | Polynomial Degree | None |
| 47 | Trignometry Features | False |
| 48 | Polynomial Threshold | None |
| 49 | Group Features | False |
| 50 | Feature Selection | False |
| 51 | Feature Selection Method | classic |
| 52 | Features Selection Threshold | None |
| 53 | Feature Interaction | False |
| 54 | Feature Ratio | False |
| 55 | Interaction Threshold | None |
| 56 | Transform Target | False |
| 57 | Transform Target Method | box-cox |
Here we can see that bayesian ridge regression and simple linear regression were the two best methods, so we are going to use them in the regression analysis.
# Search for the best regression models
top3 = pr.compare_models(n_select=3)
| Model | MAE | MSE | RMSE | R2 | RMSLE | MAPE | TT (Sec) | |
|---|---|---|---|---|---|---|---|---|
| br | Bayesian Ridge | 0.0385 | 0.0030 | 0.0534 | 0.8724 | 0.0061 | 0.0049 | 0.0100 |
| lr | Linear Regression | 0.0385 | 0.0030 | 0.0534 | 0.8723 | 0.0061 | 0.0049 | 0.7910 |
| ridge | Ridge Regression | 0.0385 | 0.0030 | 0.0534 | 0.8723 | 0.0061 | 0.0049 | 0.0060 |
| lar | Least Angle Regression | 0.0385 | 0.0030 | 0.0534 | 0.8723 | 0.0061 | 0.0049 | 0.0060 |
| huber | Huber Regressor | 0.0382 | 0.0030 | 0.0536 | 0.8718 | 0.0061 | 0.0049 | 0.0110 |
| catboost | CatBoost Regressor | 0.0383 | 0.0030 | 0.0537 | 0.8707 | 0.0061 | 0.0049 | 0.9270 |
| gbr | Gradient Boosting Regressor | 0.0391 | 0.0031 | 0.0552 | 0.8637 | 0.0063 | 0.0050 | 0.0500 |
| lightgbm | Light Gradient Boosting Machine | 0.0406 | 0.0032 | 0.0555 | 0.8624 | 0.0063 | 0.0052 | 0.0410 |
| rf | Random Forest Regressor | 0.0408 | 0.0034 | 0.0576 | 0.8522 | 0.0065 | 0.0052 | 0.1160 |
| xgboost | Extreme Gradient Boosting | 0.0440 | 0.0036 | 0.0594 | 0.8431 | 0.0067 | 0.0056 | 0.1750 |
| et | Extra Trees Regressor | 0.0429 | 0.0037 | 0.0605 | 0.8372 | 0.0069 | 0.0055 | 0.0920 |
| par | Passive Aggressive Regressor | 0.0467 | 0.0038 | 0.0612 | 0.8333 | 0.0069 | 0.0060 | 0.0070 |
| knn | K Neighbors Regressor | 0.0455 | 0.0041 | 0.0632 | 0.8223 | 0.0072 | 0.0058 | 0.0110 |
| ada | AdaBoost Regressor | 0.0532 | 0.0049 | 0.0700 | 0.7823 | 0.0079 | 0.0068 | 0.0440 |
| dt | Decision Tree Regressor | 0.0620 | 0.0072 | 0.0844 | 0.6836 | 0.0096 | 0.0079 | 0.0070 |
| omp | Orthogonal Matching Pursuit | 0.0700 | 0.0083 | 0.0906 | 0.6352 | 0.0103 | 0.0089 | 0.0050 |
| lasso | Lasso Regression | 0.1180 | 0.0229 | 0.1511 | -0.0063 | 0.0171 | 0.0150 | 0.3040 |
| en | Elastic Net | 0.1180 | 0.0229 | 0.1511 | -0.0063 | 0.0171 | 0.0150 | 0.0050 |
| llar | Lasso Least Angle Regression | 0.1180 | 0.0229 | 0.1511 | -0.0063 | 0.0171 | 0.0150 | 0.0110 |
| dummy | Dummy Regressor | 0.1180 | 0.0229 | 0.1511 | -0.0063 | 0.0171 | 0.0150 | 0.0130 |
top3
[BayesianRidge(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
compute_score=False, copy_X=True, fit_intercept=True,
lambda_1=1e-06, lambda_2=1e-06, lambda_init=None, n_iter=300,
normalize=False, tol=0.001, verbose=False),
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=-1, normalize=False),
Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=12345, solver='auto', tol=0.001)]
For the hyperparameter optimization we will be using optuna library.
Simple linear regression does not need of optimization
The bayesian ridge regression needs to be optimized.
# Dummy dictionary
scores = {'Regressor':[], 'RMSE':[]}
Optuna needs a function that needs to be optimized, for that we create a set of variables that will be entered into the Bayesian Ridge Regressor, the parameter we want to minimize is the RMSE (root mean square error).
Inside the function we will be training the regressor with the train set of data and calculating the RMSE with the test set of data.
# Function to be optimized
def br_optimizer(trial):
# Variables we will be looking for optimization the function
alpha_1 = trial.suggest_loguniform('alpha_1', 1e-11, 1e-3)
alpha_2 = trial.suggest_loguniform('alpha_2', 1e-11, 1e-3)
lambda_1 = trial.suggest_loguniform('lambda_1', 1e-11, 1e-3)
lambda_2 = trial.suggest_loguniform('lambda_2', 1e-11, 1e-3)
# Default variables
compute_score = False
fit_intercept = True
tol = 1e-9
n_iter = int(1e4)
parameters = {'alpha_1':alpha_1, 'alpha_2':alpha_2, 'lambda_1':lambda_1, 'lambda_2':lambda_2, 'compute_score':compute_score, 'fit_intercept':fit_intercept, 'tol':tol, 'n_iter':n_iter}
# Training of the model
model = BayesianRidge(**parameters)
model.fit(X_train, y_train)
# Predictions and RMSE with the test dataset
predictions = model.predict(X_test)
test_score = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_test, predictions))]])[0][0]
return test_score
# Optimization with Optuna library (100 iterations)
study = optuna.create_study(direction='minimize')
study.optimize(br_optimizer, n_trials=100)
[I 2023-04-10 21:14:36,559] A new study created in memory with name: no-name-af0681b9-663f-4787-90c7-843b026e8af3 [I 2023-04-10 21:14:36,572] Trial 0 finished with value: 0.05382337121625924 and parameters: {'alpha_1': 1.3140110789242623e-08, 'alpha_2': 0.0001457560069024532, 'lambda_1': 1.2228004513933517e-09, 'lambda_2': 2.075416250506302e-07}. Best is trial 0 with value: 0.05382337121625924. [I 2023-04-10 21:14:36,581] Trial 1 finished with value: 0.053823367166067504 and parameters: {'alpha_1': 0.0006927072229209075, 'alpha_2': 0.0001103546840193058, 'lambda_1': 1.4801871924133962e-05, 'lambda_2': 3.335636002978507e-10}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,595] Trial 2 finished with value: 0.05382337917556135 and parameters: {'alpha_1': 1.6556056409503633e-05, 'alpha_2': 4.007834120411677e-08, 'lambda_1': 9.643978996312987e-11, 'lambda_2': 2.690540387101202e-10}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,599] Trial 3 finished with value: 0.053847404667793475 and parameters: {'alpha_1': 1.7005171891687616e-07, 'alpha_2': 6.153270695563172e-07, 'lambda_1': 1.6142353727422618e-11, 'lambda_2': 0.0007510040381849365}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,618] Trial 4 finished with value: 0.05382618906960834 and parameters: {'alpha_1': 0.00011601247818458665, 'alpha_2': 7.530377635117807e-11, 'lambda_1': 2.8070975890672898e-08, 'lambda_2': 7.445115201348293e-05}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,630] Trial 5 finished with value: 0.05382337370724133 and parameters: {'alpha_1': 4.0012280782906585e-09, 'alpha_2': 4.996635185911652e-05, 'lambda_1': 3.9968127582047413e-07, 'lambda_2': 3.186854246089336e-10}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,641] Trial 6 finished with value: 0.05382388766836721 and parameters: {'alpha_1': 8.698342115372465e-05, 'alpha_2': 1.2138356888813327e-05, 'lambda_1': 4.2863636317566495e-11, 'lambda_2': 1.328733766219883e-05}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,650] Trial 7 finished with value: 0.05382584195955187 and parameters: {'alpha_1': 1.1385705813859182e-08, 'alpha_2': 2.3117561034599434e-08, 'lambda_1': 1.4188110691664925e-07, 'lambda_2': 6.509221623283479e-05}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,660] Trial 8 finished with value: 0.053825330775056024 and parameters: {'alpha_1': 2.9239102908423263e-06, 'alpha_2': 2.0908153725962885e-11, 'lambda_1': 0.0006934278180035639, 'lambda_2': 5.160894406645457e-05}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,673] Trial 9 finished with value: 0.05382338238522455 and parameters: {'alpha_1': 2.6422757717518873e-07, 'alpha_2': 4.291352830459622e-08, 'lambda_1': 6.169281977415302e-09, 'lambda_2': 8.377725862520708e-08}. Best is trial 1 with value: 0.053823367166067504. [I 2023-04-10 21:14:36,689] Trial 10 finished with value: 0.0538233095024121 and parameters: {'alpha_1': 1.8707619337361081e-10, 'alpha_2': 0.0006338782174839441, 'lambda_1': 2.9727186958985835e-05, 'lambda_2': 1.8997545527315117e-11}. Best is trial 10 with value: 0.0538233095024121. [I 2023-04-10 21:14:36,721] Trial 11 finished with value: 0.05382327326984693 and parameters: {'alpha_1': 3.221079593152367e-11, 'alpha_2': 0.0009623244603064243, 'lambda_1': 5.740264429400375e-05, 'lambda_2': 1.0564885868520879e-11}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,738] Trial 12 finished with value: 0.05382328187389751 and parameters: {'alpha_1': 1.1184146422161587e-11, 'alpha_2': 0.0008866652082431946, 'lambda_1': 2.9184660741053153e-05, 'lambda_2': 1.266470316744713e-11}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,763] Trial 13 finished with value: 0.05382337871914333 and parameters: {'alpha_1': 1.889287695805843e-11, 'alpha_2': 3.1587892640702603e-06, 'lambda_1': 8.468407855527638e-06, 'lambda_2': 1.618383486501931e-11}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,783] Trial 14 finished with value: 0.05382337085954347 and parameters: {'alpha_1': 1.0983504589257292e-11, 'alpha_2': 2.4322446445715504e-09, 'lambda_1': 0.0007474238294132763, 'lambda_2': 1.368915836443103e-08}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,802] Trial 15 finished with value: 0.05382329020803445 and parameters: {'alpha_1': 3.2037357012551e-10, 'alpha_2': 0.00081471350443961, 'lambda_1': 1.30866346201852e-06, 'lambda_2': 3.626565701533689e-09}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,823] Trial 16 finished with value: 0.053823377401077455 and parameters: {'alpha_1': 1.0851648733774125e-10, 'alpha_2': 6.096383890881681e-07, 'lambda_1': 0.0001437277308329093, 'lambda_2': 5.162568751844182e-11}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,844] Trial 17 finished with value: 0.053823411727481396 and parameters: {'alpha_1': 8.309855326620594e-10, 'alpha_2': 1.3267151376399387e-05, 'lambda_1': 2.1313645678099383e-06, 'lambda_2': 8.842086971827215e-07}. Best is trial 11 with value: 0.05382327326984693. [I 2023-04-10 21:14:36,865] Trial 18 finished with value: 0.053823270991087746 and parameters: {'alpha_1': 3.947620580965173e-11, 'alpha_2': 0.000979847123159681, 'lambda_1': 0.0001039835046195131, 'lambda_2': 4.8718136392683216e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,887] Trial 19 finished with value: 0.053823377723321686 and parameters: {'alpha_1': 2.771694206004484e-09, 'alpha_2': 8.02723040612229e-10, 'lambda_1': 0.00013074896953712994, 'lambda_2': 2.7084606928139613e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,906] Trial 20 finished with value: 0.05382342205491275 and parameters: {'alpha_1': 1.1017859373994636e-10, 'alpha_2': 1.1460754343391327e-06, 'lambda_1': 0.00019311146935906213, 'lambda_2': 1.1767449317816847e-06}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,929] Trial 21 finished with value: 0.053823278898138316 and parameters: {'alpha_1': 3.4196859355656856e-11, 'alpha_2': 0.000911805964704225, 'lambda_1': 4.88325018442434e-05, 'lambda_2': 1.3465236640893246e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,949] Trial 22 finished with value: 0.053823365214408 and parameters: {'alpha_1': 5.272981133874647e-11, 'alpha_2': 0.00012697787979060296, 'lambda_1': 5.756121416498275e-06, 'lambda_2': 1.1636511464241151e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,973] Trial 23 finished with value: 0.05382337640221513 and parameters: {'alpha_1': 7.815622971792582e-10, 'alpha_2': 1.9885137249896402e-05, 'lambda_1': 5.55837817131904e-05, 'lambda_2': 1.8081640101492535e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:36,994] Trial 24 finished with value: 0.053823345577159065 and parameters: {'alpha_1': 4.228824258466591e-11, 'alpha_2': 0.0003143583206544199, 'lambda_1': 2.934896713648682e-06, 'lambda_2': 2.1392682285470825e-08}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,012] Trial 25 finished with value: 0.05382337559536432 and parameters: {'alpha_1': 1.087215794292926e-09, 'alpha_2': 3.307029870124688e-05, 'lambda_1': 5.045040748028589e-07, 'lambda_2': 1.3994920876544598e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,051] Trial 26 finished with value: 0.053823375341510715 and parameters: {'alpha_1': 3.298666568322326e-08, 'alpha_2': 3.3073905427073365e-06, 'lambda_1': 0.00029311973891968683, 'lambda_2': 6.844533408262058e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,082] Trial 27 finished with value: 0.05382335391071025 and parameters: {'alpha_1': 9.555027977548168e-07, 'alpha_2': 0.00022694307585750838, 'lambda_1': 3.970217411729359e-05, 'lambda_2': 8.572758957042982e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,103] Trial 28 finished with value: 0.0538233631636158 and parameters: {'alpha_1': 4.464043876776926e-10, 'alpha_2': 5.776834627625402e-05, 'lambda_1': 0.0008507813725798405, 'lambda_2': 9.608630511960398e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,125] Trial 29 finished with value: 0.05382335747513056 and parameters: {'alpha_1': 4.183109023312408e-11, 'alpha_2': 0.00021599968001562788, 'lambda_1': 4.4117388066577405e-10, 'lambda_2': 5.016705920799181e-08}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,144] Trial 30 finished with value: 0.053823378489848084 and parameters: {'alpha_1': 3.2662688810841737e-08, 'alpha_2': 6.180241770072446e-06, 'lambda_1': 1.9447344339736633e-08, 'lambda_2': 4.973252329096082e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,166] Trial 31 finished with value: 0.053823301510078414 and parameters: {'alpha_1': 1.0581175907071697e-11, 'alpha_2': 0.0007084260483862266, 'lambda_1': 1.6368342621783626e-05, 'lambda_2': 1.1241818727641012e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,189] Trial 32 finished with value: 0.05382336588978531 and parameters: {'alpha_1': 3.2083716037926406e-11, 'alpha_2': 0.00011625749657152693, 'lambda_1': 4.800715008070854e-05, 'lambda_2': 1.763805104435678e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,211] Trial 33 finished with value: 0.05382327920272845 and parameters: {'alpha_1': 1.061389333562981e-10, 'alpha_2': 0.0009052184440138554, 'lambda_1': 8.561667977164974e-05, 'lambda_2': 6.364734924891871e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,232] Trial 34 finished with value: 0.05382334530156796 and parameters: {'alpha_1': 1.446938545554783e-10, 'alpha_2': 0.0003090028747635155, 'lambda_1': 6.979261714551173e-06, 'lambda_2': 2.7255330791372297e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,254] Trial 35 finished with value: 0.05382336919191655 and parameters: {'alpha_1': 4.778974486469458e-09, 'alpha_2': 6.515576399043214e-05, 'lambda_1': 0.00024306224389497537, 'lambda_2': 6.597293385879053e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,270] Trial 36 finished with value: 0.0538232741595106 and parameters: {'alpha_1': 7.281908001250026e-11, 'alpha_2': 0.0009529039918299902, 'lambda_1': 8.842264900264999e-05, 'lambda_2': 5.89301991989456e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,299] Trial 37 finished with value: 0.05382338709331358 and parameters: {'alpha_1': 1.3586040646283021e-09, 'alpha_2': 1.7334409272987657e-07, 'lambda_1': 7.882376665628722e-07, 'lambda_2': 2.0666803640205415e-07}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,320] Trial 38 finished with value: 0.053823368890573375 and parameters: {'alpha_1': 2.9033455146672044e-10, 'alpha_2': 9.453246055183351e-05, 'lambda_1': 1.6262781449127854e-05, 'lambda_2': 6.666257136171472e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,341] Trial 39 finished with value: 0.053823373999990265 and parameters: {'alpha_1': 5.854624134014343e-11, 'alpha_2': 2.772534968429219e-05, 'lambda_1': 0.0002905068756062321, 'lambda_2': 3.3777347399543456e-08}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,361] Trial 40 finished with value: 0.053823385131190005 and parameters: {'alpha_1': 1.036675562913137e-08, 'alpha_2': 0.00032325738504948836, 'lambda_1': 2.1080620773219e-07, 'lambda_2': 1.0731753471433351e-06}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,382] Trial 41 finished with value: 0.05382328103545597 and parameters: {'alpha_1': 8.257751321683128e-11, 'alpha_2': 0.0008899951605764937, 'lambda_1': 7.128117218569573e-05, 'lambda_2': 6.148142866610916e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,406] Trial 42 finished with value: 0.05382333839725195 and parameters: {'alpha_1': 2.6254752629297678e-11, 'alpha_2': 0.000363294438479445, 'lambda_1': 0.00010106403777757318, 'lambda_2': 4.011556418628913e-09}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,428] Trial 43 finished with value: 0.05382336587004133 and parameters: {'alpha_1': 0.000693546206158598, 'alpha_2': 0.00012199287562311353, 'lambda_1': 1.5819227628813916e-05, 'lambda_2': 3.2640784791719635e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,451] Trial 44 finished with value: 0.05382327288693167 and parameters: {'alpha_1': 2.450417129333051e-10, 'alpha_2': 0.0009716816094045476, 'lambda_1': 3.6048391339157605e-06, 'lambda_2': 1.1233378383620296e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,478] Trial 45 finished with value: 0.05382333334557998 and parameters: {'alpha_1': 2.80097135240987e-10, 'alpha_2': 0.0004186550914338985, 'lambda_1': 3.667975151262266e-06, 'lambda_2': 1.2831074581398117e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,499] Trial 46 finished with value: 0.05384465132106886 and parameters: {'alpha_1': 1.6706060198430448e-05, 'alpha_2': 2.352935295350688e-09, 'lambda_1': 0.00037070021623191606, 'lambda_2': 0.0006499736936982101}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,520] Trial 47 finished with value: 0.053823362178762046 and parameters: {'alpha_1': 2.0032966330766625e-11, 'alpha_2': 0.00015532038499195712, 'lambda_1': 5.666967356852198e-08, 'lambda_2': 2.2157636082752014e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,543] Trial 48 finished with value: 0.053823372751873544 and parameters: {'alpha_1': 5.795676635244549e-10, 'alpha_2': 5.59076794940952e-05, 'lambda_1': 2.6318659174641548e-05, 'lambda_2': 3.3519935569966567e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,566] Trial 49 finished with value: 0.053823379087399426 and parameters: {'alpha_1': 2.604817835262153e-09, 'alpha_2': 2.467906756921772e-10, 'lambda_1': 6.61469793162423e-06, 'lambda_2': 4.030882579317985e-11}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,584] Trial 50 finished with value: 0.05382338525716501 and parameters: {'alpha_1': 1.9616687906283416e-10, 'alpha_2': 8.279553251223625e-06, 'lambda_1': 1.751071631003279e-06, 'lambda_2': 1.8229809246146617e-07}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,610] Trial 51 finished with value: 0.05382327689334887 and parameters: {'alpha_1': 8.822868658360598e-11, 'alpha_2': 0.0009257016338373616, 'lambda_1': 9.08838898520298e-05, 'lambda_2': 4.2729578988871637e-10}. Best is trial 18 with value: 0.053823270991087746. [I 2023-04-10 21:14:37,632] Trial 52 finished with value: 0.05382326858483766 and parameters: {'alpha_1': 1.7326889970694058e-11, 'alpha_2': 0.0009584447367337944, 'lambda_1': 0.0004898774786274983, 'lambda_2': 9.90499411629696e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,654] Trial 53 finished with value: 0.05382331837802412 and parameters: {'alpha_1': 1.6684565226626355e-11, 'alpha_2': 0.0004720027482950569, 'lambda_1': 0.0007770360745296322, 'lambda_2': 2.6900701198351547e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,675] Trial 54 finished with value: 0.053823350632462574 and parameters: {'alpha_1': 7.168986631106742e-11, 'alpha_2': 0.00020712489523222837, 'lambda_1': 0.0004979407568014492, 'lambda_2': 7.755654343650697e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,715] Trial 55 finished with value: 0.05382328044352214 and parameters: {'alpha_1': 1.9450382952144344e-11, 'alpha_2': 0.0008938092173039413, 'lambda_1': 9.004223865806261e-05, 'lambda_2': 1.828890411442546e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,740] Trial 56 finished with value: 0.053823321634705046 and parameters: {'alpha_1': 1.0249010649876727e-11, 'alpha_2': 0.0005086566533339666, 'lambda_1': 0.0001615922224400536, 'lambda_2': 2.5869982483466863e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,765] Trial 57 finished with value: 0.053823370622733346 and parameters: {'alpha_1': 1.6023219393605665e-10, 'alpha_2': 7.676733644850165e-05, 'lambda_1': 2.894546580092616e-05, 'lambda_2': 5.089482834228627e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,790] Trial 58 finished with value: 0.05382337390898595 and parameters: {'alpha_1': 5.771387522291759e-11, 'alpha_2': 1.0686613216787985e-08, 'lambda_1': 0.0004484085510467557, 'lambda_2': 1.1539493081686182e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,812] Trial 59 finished with value: 0.053823358930223275 and parameters: {'alpha_1': 1.5066254022224864e-09, 'alpha_2': 0.00019053353091335814, 'lambda_1': 1.1001040094846185e-05, 'lambda_2': 1.9015452154665397e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,831] Trial 60 finished with value: 0.05382337705323681 and parameters: {'alpha_1': 4.606919415054478e-10, 'alpha_2': 2.8693225770574534e-11, 'lambda_1': 0.00017890802075105742, 'lambda_2': 8.21808785480342e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,855] Trial 61 finished with value: 0.053823275079328825 and parameters: {'alpha_1': 3.28772365676039e-11, 'alpha_2': 0.000946122452158416, 'lambda_1': 5.467665367972926e-05, 'lambda_2': 1.72066360233601e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,876] Trial 62 finished with value: 0.05382331373500593 and parameters: {'alpha_1': 2.287666229821526e-11, 'alpha_2': 0.0004949732809036604, 'lambda_1': 0.0009580257606934533, 'lambda_2': 4.66025439242141e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,898] Trial 63 finished with value: 0.05382332034865489 and parameters: {'alpha_1': 9.677142916805579e-11, 'alpha_2': 0.0005379416123910496, 'lambda_1': 1.1439502590517313e-11, 'lambda_2': 1.7940002553922074e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,918] Trial 64 finished with value: 0.05382335021826168 and parameters: {'alpha_1': 3.3743161212198103e-11, 'alpha_2': 0.0002630838027752538, 'lambda_1': 2.3578409337914984e-05, 'lambda_2': 2.6558213633686503e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,942] Trial 65 finished with value: 0.053823374443896954 and parameters: {'alpha_1': 3.766379816689584e-07, 'alpha_2': 3.673860428025373e-05, 'lambda_1': 5.9511203967091806e-05, 'lambda_2': 1.801318271029119e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,969] Trial 66 finished with value: 0.05382353407515028 and parameters: {'alpha_1': 2.4130567095848086e-10, 'alpha_2': 0.00014871364237857804, 'lambda_1': 2.9971999458587e-09, 'lambda_2': 4.450411290400289e-06}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:37,991] Trial 67 finished with value: 0.05382327468315906 and parameters: {'alpha_1': 5.303750464234971e-11, 'alpha_2': 0.0009399790422790387, 'lambda_1': 0.0001447303017493101, 'lambda_2': 9.064859878733532e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,014] Trial 68 finished with value: 0.05382337566469375 and parameters: {'alpha_1': 4.136899792810221e-11, 'alpha_2': 1.6198203237137507e-05, 'lambda_1': 0.00014626776763050322, 'lambda_2': 1.0862148941379923e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,033] Trial 69 finished with value: 0.053823318130172826 and parameters: {'alpha_1': 1.3155417475958356e-11, 'alpha_2': 0.0005051625056159538, 'lambda_1': 0.0004904857698473918, 'lambda_2': 4.031351738183745e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,061] Trial 70 finished with value: 0.05382335061379173 and parameters: {'alpha_1': 1.3745115496410153e-10, 'alpha_2': 0.00026074766307595415, 'lambda_1': 3.421676974385831e-06, 'lambda_2': 1.0794076374688886e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,081] Trial 71 finished with value: 0.05382327960986011 and parameters: {'alpha_1': 6.097613335268946e-11, 'alpha_2': 0.0009048074191320973, 'lambda_1': 5.389905038805243e-05, 'lambda_2': 3.057309832035962e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,105] Trial 72 finished with value: 0.05382332110726318 and parameters: {'alpha_1': 3.4435013711474733e-11, 'alpha_2': 0.0005032008451649218, 'lambda_1': 0.0002572379322304719, 'lambda_2': 1.9380740210608096e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,134] Trial 73 finished with value: 0.053823277942813386 and parameters: {'alpha_1': 1.0359423777682562e-10, 'alpha_2': 0.0009148344014713317, 'lambda_1': 0.00010430052675234102, 'lambda_2': 9.50400418656522e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,157] Trial 74 finished with value: 0.0538233682124607 and parameters: {'alpha_1': 0.0002435210848796929, 'alpha_2': 9.715838025695664e-05, 'lambda_1': 3.474946355834987e-05, 'lambda_2': 6.540431431803713e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,186] Trial 75 finished with value: 0.053823348292971085 and parameters: {'alpha_1': 4.421503155717986e-10, 'alpha_2': 0.000263992189865734, 'lambda_1': 0.0002053914243116235, 'lambda_2': 1.1032587447519123e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,206] Trial 76 finished with value: 0.053823273108264624 and parameters: {'alpha_1': 2.8602397442966696e-11, 'alpha_2': 0.0009936845895099622, 'lambda_1': 1.3073253485604487e-05, 'lambda_2': 7.119164958608248e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,227] Trial 77 finished with value: 0.05382334881658446 and parameters: {'alpha_1': 1.5706764557086833e-11, 'alpha_2': 0.00041682923377189455, 'lambda_1': 1.3678729940676318e-05, 'lambda_2': 3.9975236666779704e-07}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,255] Trial 78 finished with value: 0.05382336716637126 and parameters: {'alpha_1': 2.8813599653138957e-11, 'alpha_2': 0.00012546223103679672, 'lambda_1': 1.0100816790697055e-06, 'lambda_2': 4.504535713077476e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,277] Trial 79 finished with value: 0.05382338170917267 and parameters: {'alpha_1': 4.93435196114995e-11, 'alpha_2': 1.514977456473696e-07, 'lambda_1': 4.716969699677451e-06, 'lambda_2': 6.797583524962029e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,296] Trial 80 finished with value: 0.05382337925956038 and parameters: {'alpha_1': 2.063083441876548e-10, 'alpha_2': 4.1182001458582095e-05, 'lambda_1': 9.682007432167227e-06, 'lambda_2': 1.2241349021059593e-07}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,319] Trial 81 finished with value: 0.05382327264967279 and parameters: {'alpha_1': 8.319841913889777e-11, 'alpha_2': 0.0009656595542156136, 'lambda_1': 8.035222369829675e-05, 'lambda_2': 4.176677446304007e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,343] Trial 82 finished with value: 0.05382331617410485 and parameters: {'alpha_1': 2.4108120214900543e-11, 'alpha_2': 0.0005743450153092397, 'lambda_1': 2.202757794268394e-05, 'lambda_2': 1.88837038101788e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,362] Trial 83 finished with value: 0.05382334495513774 and parameters: {'alpha_1': 6.46216713562467e-11, 'alpha_2': 0.00030852590315395546, 'lambda_1': 4.046148661791484e-05, 'lambda_2': 1.9723859470554333e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,385] Trial 84 finished with value: 0.05382331306953869 and parameters: {'alpha_1': 1.0932793608092466e-11, 'alpha_2': 0.0006001872696830248, 'lambda_1': 6.197668742255176e-05, 'lambda_2': 6.897240285776377e-09}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,400] Trial 85 finished with value: 0.05382326932929393 and parameters: {'alpha_1': 1.430119556823695e-10, 'alpha_2': 0.0009911192727929125, 'lambda_1': 0.000124745086169175, 'lambda_2': 9.015844978819769e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,428] Trial 86 finished with value: 0.05382339473575115 and parameters: {'alpha_1': 1.5114257301298605e-10, 'alpha_2': 7.493471640415015e-07, 'lambda_1': 0.00032093100260480715, 'lambda_2': 5.05062431576965e-07}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,451] Trial 87 finished with value: 0.05382335785017367 and parameters: {'alpha_1': 7.171615730067153e-10, 'alpha_2': 0.0001949353103877277, 'lambda_1': 4.74412476884729e-11, 'lambda_2': 9.277868881122482e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,472] Trial 88 finished with value: 0.05382333697251096 and parameters: {'alpha_1': 3.3125907631598435e-10, 'alpha_2': 0.00032738344701351983, 'lambda_1': 0.0005409237200651538, 'lambda_2': 3.296272897840085e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,493] Trial 89 finished with value: 0.053823306756273004 and parameters: {'alpha_1': 7.479879878585563e-06, 'alpha_2': 0.0006494612708534802, 'lambda_1': 0.00011813103729209175, 'lambda_2': 1.658854175726139e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,518] Trial 90 finished with value: 0.05382347670614829 and parameters: {'alpha_1': 1.017982874363749e-07, 'alpha_2': 6.303272585709196e-05, 'lambda_1': 0.00021305039910878796, 'lambda_2': 2.7795241793224692e-06}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,555] Trial 91 finished with value: 0.05382330542192082 and parameters: {'alpha_1': 4.154224685964394e-11, 'alpha_2': 0.0006701311530893982, 'lambda_1': 4.001720595609227e-05, 'lambda_2': 1.3650556854724778e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,576] Trial 92 finished with value: 0.053823281840905235 and parameters: {'alpha_1': 8.960600209370582e-11, 'alpha_2': 0.0008913121291045513, 'lambda_1': 7.178981946558159e-05, 'lambda_2': 2.541557668362553e-08}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,601] Trial 93 finished with value: 0.0538233368764387 and parameters: {'alpha_1': 1.8059552993570575e-11, 'alpha_2': 0.00038466730698782963, 'lambda_1': 2.0539427867256193e-05, 'lambda_2': 4.92784894213601e-10}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,626] Trial 94 finished with value: 0.05382326997689013 and parameters: {'alpha_1': 2.599522404043534e-11, 'alpha_2': 0.0009856705525699478, 'lambda_1': 0.00012022361974313476, 'lambda_2': 5.266831107757517e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,648] Trial 95 finished with value: 0.05382335570702601 and parameters: {'alpha_1': 1.290307875322647e-10, 'alpha_2': 0.00020142997274026771, 'lambda_1': 0.00012114830547963926, 'lambda_2': 5.826167700546263e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,668] Trial 96 finished with value: 0.05382333421669516 and parameters: {'alpha_1': 7.269244286187916e-11, 'alpha_2': 0.0003738372152419549, 'lambda_1': 0.0003442792561739629, 'lambda_2': 2.7060004595062214e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,692] Trial 97 finished with value: 0.05382330183409634 and parameters: {'alpha_1': 1.5862671414874884e-11, 'alpha_2': 0.0006393195569786813, 'lambda_1': 0.0006282633971746383, 'lambda_2': 4.53990656575729e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,712] Trial 98 finished with value: 0.0538233609539307 and parameters: {'alpha_1': 2.5170200624100936e-11, 'alpha_2': 0.00014267118108053515, 'lambda_1': 0.00022075051005247644, 'lambda_2': 1.7262669977084965e-11}. Best is trial 52 with value: 0.05382326858483766. [I 2023-04-10 21:14:38,733] Trial 99 finished with value: 0.053823367948883316 and parameters: {'alpha_1': 4.625521325747993e-11, 'alpha_2': 8.635922913803903e-05, 'lambda_1': 0.0001500491622757453, 'lambda_2': 5.424969168185107e-11}. Best is trial 52 with value: 0.05382326858483766.
# Best parameters found are stored into a dictionary
br_params = study.best_params
br_params['compute_score'] = False
br_params['fit_intercept'] = True
br_params['tol'] = 1e-9
br_params['n_iter'] = int(1e4)
pd.DataFrame(data=br_params.values(), index=br_params.keys(), columns=['Value'])
| Value | |
|---|---|
| alpha_1 | 0.0 |
| alpha_2 | 0.000958 |
| lambda_1 | 0.00049 |
| lambda_2 | 0.0 |
| compute_score | False |
| fit_intercept | True |
| tol | 0.0 |
| n_iter | 10000 |
# RMSE of the regressor is showed
scores['Regressor'].append('BayesianRidgeRegression')
scores['RMSE'].append(study.best_value)
pd.DataFrame(data=scores)
| Regressor | RMSE | |
|---|---|---|
| 0 | BayesianRidgeRegression | 0.053823 |
# Graph of the errors during training
optuna.visualization.plot_optimization_history(study)
In this section, the Bayesian Ridge Regression and the Simple Linear Regression models are trained with the training dataset and the RMSE and R2 values are calculated for the train, test and validation datasets for each model.
A function was created to simplify the task of training and testing different models.
# This function returns a pandas DataFrame that contains the RMSE and R2 values of a regression model
# The RMSE and R2 values are calculated for the train, test and validation datasets
def train_test_val(model, params, X_train, y_train, X_test, y_test, X_valid, y_valid):
# Training of the model
model = model(**params).fit(X_train, y_train)
# Predictions for the train, test and validation datasets
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)
valid_pred = model.predict(X_valid)
# RMSE and R2 values for train, test and validation datasets
train_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_train, train_pred))]])[0][0]
train_R2 = r2_score(y_train, train_pred)
test_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_test, test_pred))]])[0][0]
test_R2 = r2_score(y_test, test_pred)
valid_RMSE = target_transformer.inverse_transform([[np.sqrt(mean_squared_error(y_valid, valid_pred))]])[0][0]
valid_R2 = r2_score(y_valid, valid_pred)
# RMSE and R2 values are stored into a dictionary
scores = {'Data':['Train', 'Test', 'Validation'], 'RMSE':[train_RMSE, test_RMSE, valid_RMSE], 'R2':[train_R2, test_R2, valid_R2]}
# The previous dictionary is transformed into a pandas DataFrame
return pd.DataFrame(data=scores)
br_params
{'alpha_1': 1.7326889970694058e-11,
'alpha_2': 0.0009584447367337944,
'lambda_1': 0.0004898774786274983,
'lambda_2': 9.90499411629696e-11,
'compute_score': False,
'fit_intercept': True,
'tol': 1e-09,
'n_iter': 10000}
With the Bayesian Ridge Regression model we obtain excellent results, the RMSE for the test and validation datasets is approximately 0.053 (a very tiny error) and the R2 is 0.88 and 0.87 for the test and validation set respectively.
train_test_val(model=BayesianRidge, params=br_params, X_train=X_train, X_test=X_test, X_valid=X_valid, y_train=y_train, y_test=y_test, y_valid=y_valid)
| Data | RMSE | R2 | |
|---|---|---|---|
| 0 | Train | 0.056251 | 0.876213 |
| 1 | Test | 0.053823 | 0.884018 |
| 2 | Validation | 0.053637 | 0.875380 |
The Simple Linear Regression got very similar results to the bayesian Ridge Regression, this means that the data used after the data treatments is very "simple". And just a linear regression is enough to obtain good results.
fit_intercept = True
lr_params = {'fit_intercept':fit_intercept}
lr_params
{'fit_intercept': True}
train_test_val(model=LinearRegression, params=lr_params, X_train=X_train, X_test=X_test, X_valid=X_valid, y_train=y_train, y_test=y_test, y_valid=y_valid)
| Data | RMSE | R2 | |
|---|---|---|---|
| 0 | Train | 0.056225 | 0.876323 |
| 1 | Test | 0.053969 | 0.883406 |
| 2 | Validation | 0.053710 | 0.875049 |
In this section we stored the coefficients obtained by the bayesian ridge and the simple linear regression models.
The magnitude of these coefficients tell us how much important is a feature or predictor for the target.
# Models are trained again with the train dataset
br = BayesianRidge(**br_params).fit(X_train, y_train)
lr = LinearRegression(**lr_params).fit(X_train, y_train)
# The name of the predictors and their respective regression coefficients are stored into a dictionary
# The absolute value of the coefficients is taken
predictors = X_train.columns.to_numpy()
br_coefs = np.abs(br.coef_)
lr_coefs = np.abs(lr.coef_)
# The last dictionary is transformed into a pandas DataFrame and the data is sorted by the magnitud of the simple linear regression coefficients
regression_coefficients = pd.DataFrame(data={'predictor':predictors, 'br_coefs':br_coefs, 'lr_coefs':lr_coefs}, columns=['predictor', 'br_coefs', 'lr_coefs'])
regression_coefficients = regression_coefficients.sort_values(by=['lr_coefs'], ascending=False)
regression_coefficients
| predictor | br_coefs | lr_coefs | |
|---|---|---|---|
| 14 | Total_Home_Quality | 0.041338 | 0.042565 |
| 0 | YearBuilt | 0.029025 | 0.031282 |
| 12 | TotRmsAbvGrd | 0.026820 | 0.027686 |
| 7 | Total_Bathrooms | 0.025816 | 0.026337 |
| 13 | SqFtPerRoom | 0.021018 | 0.021320 |
| 4 | GarageCars | 0.019344 | 0.019834 |
| 3 | TotalBsmtSF | 0.016294 | 0.016021 |
| 16 | Attchd139 | 0.013292 | 0.013830 |
| 2 | 1stFlrSF | 0.012858 | 0.012900 |
| 18 | Fireplaces | 0.012819 | 0.012486 |
| 5 | FullBath | 0.010895 | 0.012217 |
| 6 | PConc123 | 0.007808 | 0.008741 |
| 15 | HeatingQC | 0.007811 | 0.007877 |
| 10 | BsmtQual | 0.007110 | 0.007102 |
| 11 | KitchenQual | 0.005320 | 0.005329 |
| 1 | GarageYrBlt | 0.004099 | 0.002574 |
| 19 | MasVnrArea | 0.002594 | 0.002161 |
| 8 | YearRemodAdd | 0.000692 | 0.001615 |
| 9 | ExterQual | 0.000562 | 0.001166 |
| 17 | GarageFinish | 0.000854 | 0.000444 |
The following graphs are very similar and are barplots that shows the magnitude of a regression coefficient with their respective predictor names, remember the magnitude of these coefficients tell us the impact of a predictor on the target (House Prices).
And as we can expect with basic knowledge about things that impact on house prices the five most importan features are:
plt.figure(figsize=(20, 10))
plt.bar(x=regression_coefficients.predictor, height=regression_coefficients.lr_coefs)
plt.title('Simple Linear Regression Coefficients')
plt.show()
plt.figure(figsize=(20, 10))
plt.bar(x=regression_coefficients.predictor, height=regression_coefficients.br_coefs)
plt.title('Bayesian Ridge Regression Coefficients')
plt.show()